Articles, Projects, and Links

This is a list of multimodal projects, articles written about them, and web pages pointing to them. Please feel free to add projects you know, or your own research or application to this list.

THIS PAGE IS NOW OBSOLETE


References:


P.S. Aleksic and A.K. Katsaggelos. Comparison of mpeg-4 facial animation parameter groups with respect to audio-visual speech recognition performance. In IEEE International Conference on Image Processing 2005, volume III, pages 501-504, 2005.

J.T. Jiang, A. Alwan, P.A. Keating, E.T. Auer Jr., and L.E. Bernstein. On the relationship between face movements, tongue movements, and speech acoustics. Eurasip Journal on Advances in Signal Processing, 2002(11):1174-, November 2002.

D. Sodoyer, J.L. Schwartz, L. Girin, J. Klinkisch, and C. Jutten. Separation of audio-visual speech sources: A new approach exploiting the audio-visual coherence of speech stimuli. Eurasip Journal on Advances in Signal Processing, 2002(11):1165-, November 2002.

M. Heckmann, F. Berthommier, and K. Kroschel. Noise adaptive stream weighting in audio-visual speech recognition. Eurasip Journal on Advances in Signal Processing, 2002(11):1260-, November 2002.

A.V. Nefian, L.H. Liang, X.B. Pi, X.X. Liu, and K. Murphy. Dynamic bayesian networks for audio-visual speech recognition. Eurasip Journal on Advances in Signal Processing, 2002(11):1274-, November 2002.

S. Gurbuz, E.K. Patterson, Z. Tufekci, and J.N. Gowdy. Affine-invariant visual features contain supplementary information to enhance speech recognition. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2001, pages 175-, 2001.

G.A. Kalberer, P. Muller, and L.J. Van Gool. Visual speech, a trajectory in viseme space. International Journal of Imaging Systems Technology, 13(1):74-84, 2003.

G. Potamianos, C. Neti, G. Gravier, A. Garg, and A.W. Senior. Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE, 91(9):1306-1326, September 2003.

M.N. Kaynak, Q. Zhi, A.D. Cheok, K. Sengupta, Z. Jian, and K.C. Chung. Analysis of lip geometric features for audio-visual speech recognition. IEEE Transactions on Systems, Man and Cybernetics, Part A, 34(4):564-570, July 2004.

S.W. Foo, Y. Lian, and L. Dong. Recognition of visual speech elements using adaptively boosted hidden markov models. IEEE Transactions on Circuits and Systems for Video Technology, 14(5):693-705, May 2004.

L.H. Terry, D.J. Shiell, and A.K. Katsaggelos. Feature space video stream consistency estimation for dynamic stream weighting in audio-visual speech recognition. In IEEE International Conference on Image Processing 2008, pages 1316-1319, 2008.

S. Pachoud, S. Gong, and A. Cavallaro. Video augmentation for improving audio speech recognition under noise. In British Machine Vision Conference 2008, 2008.

K. Saenko, K. Livescu, M. Siracusa, K. Wilson, J. Glass, and T.J. Darrell. Visual speech recognition with loosely synchronized feature streams. In IEEE International Conference on Computer Vision 2005, volume II, pages 1424-1431, 2005.

J. Kratt, F. Metze, R. Stiefelhagen, and A. Waibel. Large vocabulary audio-visual speech recognition using the janus speech recognition toolkit. In Annual Symposium of the German Association for Pattern Recognition 2004, pages 488-495, 2004.

S. Kettebekov, M. Yeasin, and R. Sharma. Improving continuous gesture recognition with spoken prosody. In IEEE CS International Conference on Computer Vision and Pattern Recognition 2003, volume I, pages 565-570, 2003.

X.Z. Zhang, R.M. Merserratt, and M. Clements. Bimodal fusion in audio-visual speech recognition. In IEEE International Conference on Image Processing 2002, volume I, pages 964-967, 2002.

D.N. Zotkin, R. Duraiswami, and L.S. Davis. Joint audio-visual tracking using particle filters. Eurasip Journal on Advances in Signal Processing, 2002(11):1154-, November 2002.

E.K. Patterson, S. Gurbuz, Z. Tufekci, and J.N. Gowdy. Moving-talker, speaker-independent feature study, and baseline results using the cuave multimodal speech corpus. Eurasip Journal on Advances in Signal Processing, 2002(11):1189-, November 2002.

A. Garg, V. Pavlovic, and J.M. Rehg. Boosted learning in dynamic bayesian networks for multimodal speaker detection. Proceedings of the IEEE, 91(9):1355-1369, September 2003.

A. Garg, V. Pavlovic, and J.M. Rehg. Audio-visual speaker detection using dynamic bayesian networks. In IEEE CS International Conference on Computer Vision and Pattern Recognition 2000, pages 384-390, 2000.

V. Pavlovic, A. Garg, J.M. Rehg, and T.S. Huang. Multimodal speaker detection using error feedback dynamic bayesian networks. In IEEE CS International Conference on Computer Vision and Pattern Recognition 2000, volume II, pages 34-41, 2000.

T. Choudhury, J.M. Rehg, V. Pavlovic, and A.P. Pentland. Boosting and structure learning in dynamic bayesian networks for audio-visual speaker detection. In IAPR International Conference on Pattern Recognition 2002, volume II, pages 789-794, 2002.

V. Pavlovic. Multimodal tracking and classification of audio-visual features. In IEEE International Conference on Image Processing 1998, volume I, pages I: 343-347, 1998.

J.M. Rehg, K.P. Murphy, and P.W. Fieguth. Vision-based speaker detection using bayesian networks. In IEEE CS International Conference on Computer Vision and Pattern Recognition 1999, volume II, pages 110-116, 1999.

F. Talantzis, A. Pnevmatikakis, and A.G. Constantinides. Audio-visual active speaker tracking in cluttered indoors environments. IEEE Transactions on Systems, Man and Cybernetics, Part B, 37(3):799-807, June 2007.

H. Vajaria, R. Sankar, and R. Kasturi. Exploring co-occurence between speech and body movement for audio-guided video localization. IEEE Transactions on Circuits and Systems for Video Technology, 18(11):1608-1617, November 2008.

H. Vajaria, T. Islam, S. Sarkar, R. Sankar, and R. Kasturi. Audio segmentation and speaker localization in meeting videos. In IAPR International Conference on Pattern Recognition 2006, volume II, pages 1150-1153, 2006.

H. Hung and G. Friedland. Towards audio-visual on-line diarization of participants in group meetings. In ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008.

Y. Liu and Y. Sato. Finding speaker face region by audiovisual correlation. In ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008.

D. Kelly, F. Pitie, A. Kokaram, and F. Boland. A comparative error analysis of audio-visual source localization. In ECCV Workshop on Multi-camera and Multi-modal Sensor Fusion Algorithms and Applications, 2008.

N. Katsarakis, F. Talantzis, A. Pnevmatikakis, and L. Polymenakos. The ait 3d audio / visual person tracker for clear 2007. In Multimodal Technologies for Perception of Humans: International Evaluation Workshops CLEAR 2007 and RT 2007, 2007.

Y. Horii, H. Kawashima, and T. Matsuyama. Speaker detection using the timing structure of lip motion and sound. In First IEEE Workshop on CVPR for Human Communicative Behavior Analysis, pages 1-8, 2008.

O. Ikeda. Detection of a speaker in video by combined analysis of speech sound and mouth movement. In International Symposium on Visual Computing 2007, volume II, pages 602-610, 2007.

A. O'Donovan, R. Duraiswami, and J. Neumann. Microphone arrays as generalized cameras for integrated audio visual processing. In IEEE CS International Conference on Computer Vision and Pattern Recognition 2007, pages 1-8, 2007.

J. Abbas, C.K. Dagli, and T.S. Huang. A multimodality framework for creating speaker/non-speaker profile databases for real-world video. In Semantic Learning Applications in Multimedia 2007 (CVPR Workshop), pages 1-8, 2007.

A. Kushal, M. Rahurkar, L. Fei Fei, J. Ponce, and T. Huang. Audio-visual speaker localization using graphical models. In IAPR International Conference on Pattern Recognition 2006, volume II, pages 291-294, 2006.

T. Tsuji, K. Yamamoto, and I. Ishii. Real-time sound source localization based on audiovisual frequency integration. In IAPR International Conference on Pattern Recognition 2006, volume IV, pages 322-325, 2006.

G. Monaci and P. Vandergheynst. Audiovisual gestalts. In CVPR Workshop on Perceptual Organization in Computer Vision, page 200, 2006.

Z.G. Zhu, W.H. Li, E. Molina, and G. Wolberg. Ldv sensing and processing for remote hearing in a multimodal surveillance system. In IEEE CS Conference on Computer Vision and Pattern Recognition 2007, pages 1-2, 2007.

Z.G. Zhu, W.H. Li, and G. Wolberg. Integrating ldv audio and ir video for remote multimodal surveillance. In IEEE CS Conference on Computer Vision and Pattern Recognition 2005, volume III, pages 10-, 2005.

N. Megherbi, S. Ambellouis, O. Colot, and F. Cabestaing. Data association in multi-target tracking using belief theory: Handling target emergence and disappearance issue. In IEEE Conference on Advanced Video and Signal Based Surveillance 2005, pages 517-521, 2005.

N. Megherbi, S. Ambellouis, O. Colot, and F. Cabestaing. Joint audio-video people tracking using belief theory. In IEEE Conference on Advanced Video and Signal Based Surveillance 2005, pages 135-140, 2005.

X. Li, L. Sun, L.M. Tao, G.Y. Xu, and Y. Jia. A speaker tracking algorithm based on audio and visual information fusion using particle filter. In International Conference on Image Analysis and Recognition 2004, volume II, pages 572-580, 2004.

A. Blake, M. Gangnet, P. Perez, and J. Vermaak. Integrated tracking with vision and sound. In CIAP01, pages 354-357, 2001.

A.V. Nefian, L.H. Liang, T.Y. Fu, and X.X. Liu. A bayesian approach to audio-visual speaker identification. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2003, pages 761-769, 2003.

A. Albiol, L. Torres, and E.J. Delp. Fully automatic face recognition system using a combined audio-visual approach. IEE Proceedings on Vision, Image and Signal Processing, 152(3):318-326, June 2005.

S. Palanivel and B. Yegnanarayana. Multimodal person authentication using speech, face and visual speech. Computer Vision and Image Understanding, 109(1):44-55, January 2008.

G. Chetty and M. Wagner. Robust face-voice based speaker identity verification using multilevel fusion. Image and Vision Computing, 26(9):1249-1260, September 2008.

G. Chetty and M. Wagner. Audio visual speaker verification based on hybrid fusion of cross modal features. In Lecture Notes in Computer Science, volume 4815, pages 469-478, 2007.

I. Naseem and A.S. Mian. User verification by combining speech and face biometrics in video. In International Symposium on Visual Computing 2008, volume II, pages 482-492, 2008.

E.A. Rua, J.L.A. Castro, and C.G. Mateo. Quality-based score normalization for audiovisual person authentication. In International Conference on Image Analysis and Recognition 2008, 2008.

A. Das. Audio visual person authentication by multiple nearest neighbor classifiers. In Lecture Notes in Computer Science, volume 4642, pages 1114-1123, 2007.

G. Chetty and M. Wagner. Face-voice authentication based on 3d face models. In Lecture Notes in Computer Science, volume 3851, pages 559-568, 2006.

Z.Y. Wu, L.H. Cai, and H. Meng. Multi-level fusion of audio and visual features for speaker identification. In Lecture Notes in Computer Science, volume 3832, pages 493-499, 2005.

P. Yang, Y.C. Yang, and Z.H. Wu. Exploiting glottal information in speaker recognition using parallel gmms. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2005, pages 804-, 2005.

Z. Lei, Y.C. Yang, and Z.H. Wu. An ubm-based reference space for speaker recognition. In IAPR International Conference on Pattern Recognition 2006, volume IV, pages 318-321, 2006.

D.D. Li, Y.C. Yang, and Z.H. Wu. Dynamic bayesian networks for audio-visual speaker recognition. In Lecture Notes in Computer Science, volume 3832, pages 539-545, 2005.

V. Pavlovic, G. Berry, and T.S. Huang. Integration of audio/visual information for use in human-computer intelligent interaction. In IEEE International Conference on Image Processing 1997, volume I, pages 121-124, 1997.

N.A. Fox, B.A. O'Mullane, and R.B. Reilly. Audio-visual speaker identification via adaptive fusion using reliability estimates of both modalities. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2005, pages 787-, 2005.

D. Zhang, A. Ghobakhlou, and N. Kasabov. An adaptive model of person identification combining speech and image information. In International Conference on Control, Automation, robotics, and Vision 2004, volume I, pages 413-418, 2004.

N.A. Fox and R.B. Reilly. Audio-visual speaker identification based on the use of dynamic audio and visual features. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2003, pages 743-751, 2003.

J. Czyz, S. Bengio, C. Marcel, and L. Vandendorpe. Scalability analysis of audio-visual person identity verification. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2003, pages 752-760, 2003.

S. Bengio. Multimodal authentication using asynchronous hmms. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2003, pages 770-777, 2003.

S. Lucey and T.H. Chen. Improved audio-visual speaker recognition via the use of a hybrid combination strategy. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2003, pages 929-936, 2003.

N. Poh and J. Korczak. Hybrid biometric person authentication using face and voice features. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2001, pages 348-, 2001.

J.E. Higgins and R.I. Damper. An hmm-based subband processing approach to speaker identification. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2001, pages 169-, 2001.

R. Sharma, M. Yeasin, N. Krahnstoever, I. Rauschert, G. Cai, I. Brewer, A.M. MacEachren, and K. Sengupta. Speech-gesture driven multimodal interfaces for crisis management. Proceedings of the IEEE, 91(9):1327-1354, September 2003.

J. Kleindienst, T. Macek, L. Seredi, and J. Sedivy. Interaction framework for home environment using speech and vision. Image and Vision Computing, 25(12):1836-1847, December 2007.

J. Kleindienst, T. Macek, L. Seredi, and J. Sedivy. Djinn: Interaction framework for home environment using speech and vision. In Lecture Notes in Computer Science, volume 3058, pages 153-164, 2004.

L. Wang, D. Tjondrongoro, and Y. Liu. Clustering and visualizing audio-visual dataset on mobile devices in a topic-oriented manner. In Lecture Notes in Computer Science, volume 4781, pages 310-321.

D. Stodle, J.M. Bjorndalen, and O.J. Anshus. A system for hybrid vision- and sound-based interaction with distal and proximal targets on wall-sized, high-resolution tiled displays. In Lecture Notes in Computer Science, volume 4796, pages 59-68.

T. Hermann, T. Henning, and H. Ritter. Gesture desk an integrated multi-modal gestural workplace for sonification. In Lecture Notes in Computer Science, volume 2915, pages 369-379, 2004.

G. Merola and I. Poggi. Multimodality and gestures in the teacher's communication. In Lecture Notes in Computer Science, volume 2915, pages 101-111, 2004.

F. Althoff, G. McGlaun, M. Lang, and G. Rigoll. Evaluating multimodal interaction patterns in various application scenarios. In Lecture Notes in Computer Science, volume 2915, pages 421-435, 2004.

A. Kranstedt, P. Kuhnlein, and I. Wachsmuth. Deixis in multimodal human computer interaction:an interdisciplinary approach. In Lecture Notes in Computer Science, volume 2915, pages 112-123, 2004.

N. Krahnstoever, E. Schapira, S. Kettebeko, and R. Sharma. Multimodal human-computer interaction for crisis management systems. In Sixth IEEE Workshop on Applications of Computer Vision 2002, pages 203-207, 2002.

M. Delakis, G. Gravier, and P. Gros. Audiovisual integration with segment models for tennis video parsing. Computer Vision and Image Understanding, 111(2):142-154, August 2008.

T.M. Hospedales and S. Vijayakumar. Structure inference for bayesian multisensory scene understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12):2140-2157, December 2008.

W. Zajdel, J.D. Krijnders, T. Andringa, and D.M. Gavrila. Cassandra: audio-video sensor fusion for aggression detection. In IEEE Conference on Advanced Video and Signal Based Surveillance 2007, pages 200-205, 2007.

P.W.J. van Hengel and T.C. Andringa. Verbal aggression detection in complex social environments. In IEEE Conference on Advanced Video and Signal Based Surveillance 2007, pages 15-20, 2007.

L. Xin, J.H. Tao, and T.N. Tan. Dynamic audio-visual mapping using fused hidden markov model inversion method. In IEEE International Conference on Image Processing 2007, volume III, pages 293-296, 2007.

A.L. Casanovas, G. Monaci, and P. Vandergheynst. Blind audiovisual source separation using sparse representations. In IEEE International Conference on Image Processing 2007, volume III, pages 301-304, 2007.

S. Nakamura. Fusion of audio-visual information for integrated speech processing. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2001, pages 127-, 2001.

N.A. Fox, B.A. O'Mullane, and R.B. Reilly. Valid: A new practical audio-visual database, and comparative results. In IAPR International Conference on Audio- and Video-based Biometric Person Authentication 2005, page 777, 2005.

Add Discussion